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Abstract 

The field of digital preservation is being defined 
by a set of standards developed top-down, starting 
with an abstract reference model (OAIS) and grad- 
ually adding more specific detail. Systems claiming 
conformance to these standards are entering produc- 
tion use. Work is underway to certify that systems 
conform to requirements derived from OAIS. 

We complement these requirements derived 
top-down by presenting an alternate, bottom-up 
view of the field. The fundamental goal of these sys- 
tems is to ensure that the information they contain 
remains accessible for the long term. We develop 
a parallel set of requirements based on observations 
of how existing systems handle this task, and on 
an analysis of the threats to achieving the goal. On 
this basis we suggest disclosures that systems should 
provide as to how they satisfy their goals. 

SRevision: 1.31 $ 

1 Introduction 

The field of digital preservation systems has 
been defined by the Open Archival Information Sys- 
tem (OAIS) standard ISO 14721:2003 HJ, which 
provides a high-level reference model. This model 
has been very useful. It identifies the participants, 
describes their roles and responsibilities, and classi- 
fies the types of information they exchange. How- 
ever, because it is only a high-level reference model, 
almost any system capable of storing and retrieving 
data can make a plausible case that it satisfies the 
OAIS conformance requirements. 

Work is under way to elaborate the OAIS refer- 
ence model with sufficient detail to allow systems to 
be certified by an ISO 9000- like process jSJ, and to 
allow systems to inter-operate on the basis of com- 
mon specifications for ingesting and disseminating 
information [J5J|^. In the same way that ISO 14721 



was developed top-down, these efforts are also top- 
down. 

Several digital preservation systems are in, or 
about to enter, production use preserving content 
society deems important. It seems an opportune 
moment to complement the OAIS top-down effort 
to generate requirements for such systems with a 
bottom-up approach. We start by identifying the 
goal the systems are intended to achieve, and then 
analyze the spectrum of threats that might prevent 
them doing so. We list the strategies that systems 
can adopt to counter these threats, providing ex- 
amples from some current systems showing how the 
strategies can be implemented. We observe that cur- 
rent systems vary in the set of threats they consider 
important, in the strategies they choose to imple- 
ment, and in the ways in which they implement 
them. Given this, and the relatively short experi- 
ence base on which to draw conclusions as to which 
approaches work better, we agree with the top-down 
proponents that setting requirements explicitly or 
implicitly mandating specific technical approaches 
seems imprudent. This paper presents a list of dis- 
closures that we suggest should form part of the 
basis for comparing and certifying systems in the 
medium term. 

We draw on our six years experience develop- 
ing, deploying, and talking to librarians and publish- 
ers about the LOCKSS 1 digital preservation sys- 
tem |.35| . Our descriptions of other systems are 
based on published materials and past discussions 
with their implementors. Although a comprehen- 
sive survey of current digital preservation systems 
would be useful, this paper does not attempt such a 
survey. We refer to systems simply to demonstrate 
that particular techniques are currently in use and 
do not attempt to list all systems using them. Note 



1 LOCKSS is a trademark of Stanford University, It stands 
for Lots Of Copies Keep Stuff Safe 
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that we believe all the systems to which we refer 
satisfy the conformance requirements of ISO 14721. 

2 Goal 

The goal of a digital preservation system is 
that the information it contains remains accessible 
to users over a long period of time. 

The key problem in the design of such systems 
is that the period of time is very long, much longer 
than the lifetime of individual storage media, hard- 
ware and software components, and the formats in 
which the information is encoded. If the period were 
shorter, it would be simple to satisfy the require- 
ment by storing the information on suitably long- 
lived media embedded in a system of similarly long- 
lived hardware and software. 

No media, hardware or software exists in whose 
longevity designers can place such confidence. They 
must therefore anticipate failures and obsolescence, 
designing systems with three key properties: 

• At minimum, the system must have no single 
point of failure; it must tolerate the failure of 
any individual component (see Section |4.1|) . In 
general, systems should be designed to tolerate 
more than one simultaneous failure. 

• Media, software and hardware must flow 
through the system over time as they fail or 
become obsolete, and are replaced. The system 
must support diversity among its components 
to avoid monoculture vulnerabilities, to allow 
for incremental replacement, and to avoid ven- 
dor lock-in. (see Section l4~4l . 

• Most data items in an archive are accessed in- 
frequently. A system that detected errors and 
failures only upon user access would be vulnera- 
ble to an accumulation of latent errors [53 . The 
system must provide for regular audits at inter- 
vals frequent enough to keep the probability of 
failure at acceptable levels (See Section |4~5|) . 

The major contrast between the top-down and 
bottom-up approaches can be summed up as being 
between the figure and the ground in a view of the 
system. The top-down approach naturally focuses 
on what the system should do, in terms of exchang- 
ing this kind of data and this kind of meta-data with 
these types of participant. Whereas the bottom- 
up approach naturally focuses on what the system 
should not do, in terms of losing data or delaying 
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access under specific types of failures. Both views 
have value to system designers. 

3 Threats 

We concur with the recent National Re- 
search Council recommendations to the National 
Archives jM] that the designers of a digital preserva- 
tion system need a clear vision of the threats against 
which they are being asked to protect their system's 
contents, and those threats under which it is accept- 
able for preservation to fail. Note that many of these 
threats are not unique to digital preservation sys- 
tems, but their specific mission and very long time 
horizons incline such systems to view the threats dif- 
ferently from more conventional systems. 

To assist in the development of these threat 
models, we present the following taxonomy of 
threats. Threat models should cither include or ex- 
plicitly exclude at least these threats: 

Media Failure All storage media must be ex- 
pected to degrade with time, causing irrecov- 
erable bit errors, and to be subject to sudden 
catastrophic irrecoverable loss of bulk data such 
as disk crashes [HI] or loss of off-line media ■ 

Hardware Failure All hardware components 
must be expected to suffer transient re- 
coverable failures, such as power loss, and 
catastrophic irrecoverable failures, such as 
burnt-out power supplies. 

Software Failure All software components must 
be expected to suffer from bugs that pose a risk 
to the stored data. 

Communication Errors Systems cannot assume 
that the network transfers they use to ingest or 
disseminate content will either succeed or fail 
within a specified time period, or will actually 
deliver the content unaltered. A recent study 
"suggests that between one (data) packet in ev- 
ery 16 million packets and one packet in 10 bil- 
lion packets will have an undetected checksum 
error" 

Failure of Network Services Systems must an- 
ticipate that the external network services they 
use, including resolvers such as those for domain 
names and persistent URLs 03] , will suffer 
both transient and irrecoverable failures both of 
the network services and of individual entries in 
them. As examples, domain names will vanish 



or be reassigned if the registrant fails to pay the 
registrar, and a persistent URL will fail to re- 
solve if the resolver service fails to preserve its 
data with as much care as the digital preserva- 
tion service. 

Media &; Hardware Obsolescence All media 
and hardware components will eventually 
fail. Before that, they may become obsolete 
in the sense of no longer being capable of 
communicating with other system components 
or being replaced when they do fail. This 
problem is particularly acute for removable 
media, which have a long history of remaining 
theoretically readable if only a suitable reader 
could be found |77). 

Software Obsolescence Similarly, software com- 
ponents will become obsolete. This will often 
be manifested as format obsolescence when, al- 
though the bits in which some data was encoded 
remain accessible, the information can no longer 
be decoded from the storage format into a legi- 
ble form. 

Operator Error Operator actions must be ex- 
pected to include both recoverable and irrecov- 
erable errors. This applies not merely to the 
digital preservation application itself, but also 
to the operating system on which it is running, 
the other applications sharing the same envi- 
ronment, the hardware underlying them, and 
the network through which they communicate. 

Natural Disaster Natural disasters, such as 
flood |37| . fire and earthquake must be antic- 
ipated. They will typically be manifested by 
other types of threat, such as media, hardware 
and infrastructure failures. 

External Attack Paper libraries and archives are 
subject to malicious attack [20]; there is no rea- 
son to expect their digital equivalents to be ex- 
empt. Worse, all systems connected to public 
networks are vulnerable to viruses and worms. 
Digital preservation systems must either defend 
against the inevitable attacks, or be be com- 
pletely isolated from external networks. 

Internal Attack Much abuse of computer systems 
involves insiders, those who have or used to have 
authorized access to the system . Even if 
a digital preservation system is completely iso- 
lated from external networks, it must anticipate 
insider abuse. 
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Economic Failure Information in digital form is 
much more vulnerable to interruptions in the 
money supply than information on paper. 
There are ongoing costs for power, cooling, 
bandwidth, system administration, domain reg- 
istration, and so on. Budgets for digital preser- 
vation must be expected to vary up and down, 
possibly even to zero, over time. 

Organizational Failure The system view of dig- 
ital preservation must include not merely the 
technology but the organization in which it is 
embedded. These organizations may die out, 
perhaps through bankruptcy, or change mis- 
sions. This may deprive the digital preservation 
technology of the support it needs to survive. 
System planning must envisage the possibility 
of the asset represented by the preserved con- 
tent being transferred to a successor organiza- 
tion, or otherwise being properly disposed of. 

For each of these types of failure, it is necessary 
to trade off the cost of defense against the level of 
system degradation under the threat that is regarded 
as acceptable for that cost. The degradation may be 
evaluated in terms of the following questions: 

• What fraction of the system's content is irrecov- 
erably lost? 

• What fraction of the user population suffers 
what delay in accessing the impaired but re- 
coverable fraction of the system's content? 

Designers should be aware that these threats 
are likely to be highly correlated. For example, op- 
erators stressed by responding to one threat, such 
as hardware failure or natural disaster, are far more 
likely to make mistakes than they are when things 
are calm jJS]. Equally, software failures are likely to 
be triggered by hardware failures which present the 
software with conditions its designers failed to antici- 
pate and under which it has never been tested. Mean 
Time Between Failure estimates are typically based 
on the assumption that failures occur independently 
(e.g. @1>1); even small correlations between the fail- 
ures can render the estimates wildly optimistic. 

4 Strategies 

We now survey the strategies that system de- 
signers can employ to survive these threats. 



4.1 Replication 

The most basic strategy exploits the funda- 
mental attribute that distinguishes digital from ana- 
log information, the possibility of copying it without 
loss of information, to store multiple replicas of the 
information to be preserved. Clearly, a single replica 
subject to the threats above has a low probability 
of long-term survival, so replication is a necessary 
attribute of a digital preservation system but it is 
far from sufficient, as anyone who has had trouble 
restoring a file from a backup copy can appreciate. 

Examples of a common approach to replica- 
tion among current digital preservation systems are 
Florida's DAITSS (HHj and the system under devel- 
opment at the British Library (BL) 0. Both use a 
fixed number (3,4) of replicas, automatically creat- 
ing replicas of each submitted item at geographically 
distributed sites. Each site may create further, off- 
line backup replicas. 

An example of a system using dynamic, and 
much higher levels, of replication is the LOCKSS 
peer-to-peer digital preservation system, in which 
each participating library collects its own copy of 
the information in which it is interested. The level 
of replication for an item is set by the number of li- 
braries that collect it, which ranges in the deployed 
system from 6 to 80 or more. The LOCKSS audit- 
ing process (see Section l4~5)l advises peer operators 
to establish more replicas when the number their 
peer can locate drops below a preset threshold. 

4.2 Migration 

The creation and management of replicas 
which lies at the base of a digital preservation system 
involves a process of migration, between instances of 
the same type of storage medium, from one medium 
to another, and from one format to another. Migra- 
tions can be exceptional events, handled by the sys- 
tem operators perhaps on a batch basis, or routine 
events, handled automatically by the system with- 
out operator intervention. 

Migration between instances of the same 
medium, for example network transfers from mass 
storage at one site to mass storage at another, is 
typically used to implement replication and to re- 
fresh media. All systems employing replication ap- 
pear to use it. It can be effective against media and 
hardware failures. 

The classic example of migration between me- 
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dia is tape backup, used by many systems. It can 
be effective against media, hardware and software 
failures and obsolescence. 

Format migration can be an effective strat- 
egy to combat software obsolescence. Many sys- 
tems do format migration, in some cases preemp- 
tively on ingress (e.g. ANA, the Australian Na- 
tional Archives 5Hj)i m some cases on a batch ba- 
sis (e.g. DAITSS) and in some cases on access (e.g. 
LOCKSS Preemptive migration on input is of- 

ten called normalization. All such systems of which 
we are aware plan to preserve the original format; 
some in addition preserve the result of format mi- 
gration. 

Some systems, for example that at the Konin- 
klijke Bibliotheek (KB, National Library of the 
Netherlands), avoid the issue of format migration by 
accepting information for preservation only in for- 
mats believed to be suitable, in KB's case PDF |73j . 
and by pursuing a strategy of emulation [23 EH] to 
ensure that the interpreters for the chosen formats 
will remain usable. 

4.3 Transparency 

Digital preservation technology shares some at- 
tributes with encryption technology. Perhaps the 
most important is that in both cases the customer 
has no way to be sure that the system will continue 
to perform its assigned task, of preserving or pre- 
venting access to the system's content as the case 
may be. An encryption system may be broken or 
misused and therefore reveal content. However long 
you watch a digital preservation system, you can 
never be sure it will continue to provide access in 
the future. 

In both cases transparency is key to the cus- 
tomer's confidence in the system. Just as open 
source, open protocols and open interfaces provide 
the basis for the public review that allows customers 
to have confidence in encryption systems such as 
AES 021 j similar reviews based on similar intro- 
spection are needed if customers are to have con- 
fidence that their digital preservation systems will 
succeed. Examples of open-source digital preserva- 
tion systems include the LOCKSS system and MIT's 
DSpace system |62| . 

An essential precaution against the software of 
a digital preservation system becoming obsolete is 
that it be preserved with at least as much care as 
the information which it is preserving. Open source 



makes this easy. Open protocols and open inter- 
faces are a necessary but not sufficient precondition 
for diverse implementations of system components 
(see Section lPf) and for effective "third-party" audit 
mechanisms (see Section f4. 5 .2f) . We have yet to see 
examples of diverse implementations or third-party 
audit in practice. 

Transparency is also the key to the ability to 
perform format migration. Widely used data for- 
mats well-supported by open source interpreters, 
such as the majority of those used on the Web, 
are easy to migrate j^J 07] • Proprietary formats, 
particularly those supported by a business model 
that thrives on backwards incompatibility, are much 
harder. The hardest of all are proprietary formats 
entwined with proprietary hardware, such as game 
consoles. 

While there is clearly a role in digital preser- 
vation for proprietary, closed software that imple- 
ments open interfaces or formats, using closed soft- 
ware and proprietary interfaces or formats renders 
the preserved information hostage to the vendor's 
survival and is hard to justify. Transparency in gen- 
eral, and open source in particular, can be an effec- 
tive strategy against all forms of obsolescence. Ac- 
cess to the source encourages wide review of the sys- 
tem for vulnerabilities, which can help prevent at- 
tacks succeeding. Open source can also be effective 
against economic failure, by preventing an organiza- 
tion's financial troubles from dooming the system's 
technology. Open source software will not be viewed 
as an asset to be sold or fall under the control of a 
bankruptcy court. Note that open source software 
may be supported by major companies such as IBM 
and Sun Microsystems, rather than depending upon 
volunteer community support. 

Despite the best efforts of system designers 
and implementors, and despite the certifications ex- 
pected to be available for digital preservation sys- 
tems, data will be lost. To improve the performance 
of systems over time, it is essential that lessons be 
learned from incidents that risk or cause data loss. 
We can expect that such incidents will be infrequent, 
making it important to extract the maximum bene- 
fit from each. Past incidents suggest that an insti- 
tution's reaction to data loss is typically to cover it 
up, preventing the lessons being learned. This paper 
shows this problem, in that we have no way to cite 
or discuss the details of several incidents of this kind 
known to practitioners via the grapevine. 
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This problem is familiar in aviation safety, and 
it has led NASA to establish the Aviation Safety 
Reporting System Through this no- fault re- 

porting system incidents can be reported in suitably 
anonymized form, allowing the community to learn 
from rare but important failures without penalizing 
those reporting them. A similar system would be of 
great benefit in digital preservation. 

4.4 Diversity 

Systems lacking diversity, in the extreme 
mono- cultures, are vulnerable to catastrophic fail- 
ure. Ideally, a digital preservation system should 
provide diversity at all levels, but most systems pro- 
vide it at only a few, citing cost considerations: 

• Most systems use off-line media to provide di- 
versity in media for storing replicas, and to 
isolate some replicas as far as possible from 
network-borne threats. 

• Many systems use geographic dispersion of on- 
line replicas to counter threats of natural disas- 
ter (e.g. DAITSS and the BL's system). Most 
systems using off-line backups store them off- 
site, again providing geographic diversity. The 
LOCKSS system has replicas scattered around 
the world. 

• The BL's system is an example of explicit plan- 
ning for diversity in hardware and vendors to 
support a process of "rolling procurement" and 
"rolling replacement" 0- The library's con- 
tinuous collection program means that the sys- 
tem must grow incrementally, its availability re- 
quirements mean that replicas must be replaced 
incrementally (a sound approach to preventing 
correlated administration errors |29|b and its 
long planned lifetime means that vendor lock- 
in is unacceptable. 

• Similar considerations apply to software. There 
should be a diversity of software among the 
replicas. The BL's system anticipates that at 
any one time different replicas will be running 
earlier or later versions of their management 
software, and that the different manufacturers 
of the underlying storage technologies will pro- 
vide some level of software diversity. 

• The BL's and LOCKSS systems are examples of 
diversity of system administration. Each replica 
is independently administered; there is no sin- 
gle password whose compromise could affect all 



replicas. Given the prevalence of human error 
and insider abuse of computer systems, unified 
system administration should be an unaccept- 
able feature of digital preservation. 

• The Portico ^3] an d LOCKSS systems are 
striving for diversity of funding. As regards the 
peers actually storing content the LOCKSS sys- 
tem is already diverse; each peer is owned and 
supported by its host library so no single bud- 
get cut or administrative decision can cause the 
system as a whole to lose content. Portico as a 
whole and the team that supports the LOCKSS 
system are both in the process of transition from 
sole-source grant funding, to support by the li- 
braries using the service. In this model no single 
budget decision would affect more than a few 
percent of the team's total income. 

The risk of inadequate diversity is particularly 
acute for networked computer systems such as dig- 
ital preservation systems. Techniques have been 
available for some years by which an attacker can 
compromise almost all systems sharing a vulnerabil- 
ity in a very short time |65| . Worms such as Slam- 
mer |39| have used them in the wild. System design- 
ers would be unwise to believe that they can con- 
struct, configure, upgrade, and expand systems for 
the long term that are not exploitable in this way. 

Replicated systems can prevent attacks result- 
ing in catastrophic failure by arranging that repli- 
cas do not share common implementations and thus 
common vulnerabilities. This approach has been ex- 
plored in a data storage context at UCSD [21], but 
we are not aware of any production digital preser- 
vation systems currently using diversity in a similar 
way. The LOCKSS system is taking its first steps in 
this direction; a version of its network appliance |53j 
based on a second operating system is under devel- 
opment. 

4.5 Audit 

Most data items in digital preservation sys- 
tems are, by their archival nature, rarely accessed 
by users. Although in aggregate systems such as 
the Internet Archive may satisfy a large demand 
from users, the average interval between successive 
accesses to any individual item in the archive is 
long(S2I. Many systems (e.g. DAITSS) are designed 
as dark archives which envisage user access only if 
exceptional circumstances render a separate access 
replica unavailable. Similarly, systems providing de- 
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posit for copyright purposes, e.g. at the British Li- 
brary |23, and the KB ^J, often reassure publishers 
that deposit is not a risk to their business models by 
placing severe restrictions on access to the deposited 
copy. For example, they may provide access only to 
readers physically at the library. 

Because users access the typical preserved data 
item very infrequently, the system cannot rely on 
user accesses to detect, and thus trigger the response 
to, errors and failures. Provision must therefore be 
made for regular audits at sufficiently frequent in- 
tervals to keep the probability of failure at accept- 
able levels. Errors and failures in system compo- 
nents may be latent @B|, that is they may only re- 
veal themselves long after they occur. Even if user 
detection of corruption and failure were reliable, it 
would not happen in time to prevent loss. 

A second important reason why audit mecha- 
nisms are important is that digital preservation sys- 
tems are typically expensive to operate, and subject 
to a high risk of economic failure. Different systems 
have different business models, but they fall into two 
broad groups: 

First-party systems which store information be- 
longing to the organization operating the sys- 
tem. Examples are internal corporate sys- 
tems, state archives such as DAITSS, the US 
National Records and Archives Administration 
(NARA), and the UK Public Records Office, 
and LOCKSS peers, which operate like a pa- 
per library storing a "purchased" copy of the 
content. 

Third-party systems which provide their cus- 
tomers or the taxpayers the service of holding 
information belonging to a publisher. Examples 
are national library copyright deposit systems, 
and archives such as JSTOR and Portico. 

Whether the funds come from internal sources, 
the taxpayers, or customers of the digital preserva- 
tion service, the funders will require evidence that, 
in return for their funds, the service of providing ac- 
cess is actually being provided. Audit mechanisms 
are the means by which this assurance can be pro- 
vided, and the risk of economic failure mitigated. 

4.5.1 Audit During Ingest 

Some digital preservation systems (e.g. 
DSpace and the KB's system) ingest content using 



a push model in which the content publisher takes 
action to deposit it in the system, whereas others 
(e.g. the Internet Archive 19 and the LOCKSS 
system) ingest using a pull model; they crawl the 
publisher's web sites to ingest content. 

Neither process is immune from the threats 
outlined above. Some form of auditing must be 
used to confirm the authenticity of the ingested con- 
tent [HI- 

The LOCKSS ingest process is driven by a 
Submission Information Package (SIP, in OAIS ter- 
minology) The SIP can come in one of two forms, 
an OAI-PMH 72 query asking for all new content 
since the last crawl, or a manifest page that points 
to starting points for a targeted crawl. Neither di- 
rectly specifies all the files to be collected (for exam- 
ple, the OAI-PMH query may miss preexisting files 
that new files include by reference). In each case 
further crawling is required to ensure that the con- 
tent unit is complete. Each peer collects its replica 
independently; network and server problems may 
cause the set of collected URLs to differ between 
the peers. The normal LOCKSS audit process (see 
Section 11.5.3(1 serves to find the differences, and re- 
solve them by re-crawling the publisher's web site. 
The peers thus arrive at a consensus as to what the 
publisher is publishing 2 . 

4.5.2 Third-party Audit 

One common approach to audit involves re- 
trieving a sample of the system's content, computing 
a message digest of the retrieved content, and com- 
paring it with a message digest of the same content 
computed earlier and preserved in some way other 
than in the preservation system. If the previous mes- 
sage digest is computed over the entire item submit- 
ted (the SIP) and thus includes the metadata, and if 
the system is capable of retrieving the entire SIP as 
part or all of a Dissemination Information Package 
(DIP), this has the attractive property of being an 
end-to-end validation of the system's performance. 

There are a number of problems with this ap- 
proach. The first is that the content and its previous 
digest are both bit strings. A mismatch between the 
current digest and the previous one means that ei- 

2 This currently limits the system to preserving content 
that is published once and thereafter remains unchanged. 
Work is underway to lift this restriction although preserv- 
ing sites such as BBC News "updated every minute of every 
day" |I] with complete fidelity will remain beyond the state 
of the art. 
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ther the content or the digest or both are corrupt. 
Both bit strings must be preserved between audits. 
Ideally the digests must be preserved by some means 
of digital preservation other than the system being 
audited 3 . Admittedly, they are much smaller than 
the SIPs and DIPs, but there is at least one for ev- 
ery SIP. A large system cannot, therefore, rely on 
techniques such as printing them on acid-free paper; 
locating and transcribing the digests would be too 
time-consuming and error-prone. The digests need 
to be stored in a database that can be queried dur- 
ing the audit. In the limit this sets up an infinite 
regress of preservation systems. So-called "entangle- 
ment" protocols can be used to mitigate this risk 
by preventing attackers re-writing history to change 
previous message digests without detection, but they 
are complex and have yet to be deployed in practice. 

Other problems are identified by Henson |17j . 
The most important is that, while disagreement be- 
tween the current and previous digests gives a very 
strong presumption that either the content or the 
digest is corrupt, agreement between them gives a 
much weaker presumption that they are unchanged. 
This is not just because an attacker might have been 
able to change both the content and the digest, but 
also because digest algorithms are inherently subject 
to collisions, in which two different inputs generate 
the same digest. Digest algorithms are designed to 
make collisions unlikely, but some of the assump- 
tions underlying these designs do not hold in digital 
preservation applications. For example, the analysis 
of the algorithm normally assumes that the input is 
a random string of bits, which for digital preserva- 
tion is unlikely. 

Another is that, like encryption algorithms, 
over time message digest algorithms become vul- 
nerable. Recently, for example, the widely used 
MD5 [23 and SHA1 [HI algorithms appear to have 
been broken. Breaks like these are, in effect, latent 
errors because there may be considerable delay be- 
tween the actual break and knowledge of it becom- 
ing public. During this time auditing with message 
digests may be ineffective. 

If a digital preservation system audits against 
previous message digests it must preemptively, be- 
fore the current algorithm is broken, replace it. To 
do so, it should audit against the current digest to 
confirm that the item is still good then compute a 

3 Storing the previous message digests in the same system 
can be useful, but it does not protect against operator error, 
external attack, internal attack or software failures. 



digest using the replacement algorithm. This should 
be appended to the stored list of digests for the item. 
DAITSS is an example of a system auditing against 
previous digests stored in the system itself; it uses 
two different algorithms in parallel to increase the 
reliability of audit and reduce the risk from broken 
algorithms. 

Note that the result of format migration will 
have a different digest from the original and, if it 
is itself preserved, must have its own stored list of 
digests. This is another reason why all systems we 
have found preserve the original in addition to (e.g. 
ANA) or instead of (e.g. LOCKSS) the migrated 
version (see Section FOll . 

4.5.3 Mutual Audit 

An alternate approach to audit that is not sub- 
ject to the risks of previous message digests is used 
in the LOCKSS system. Instead of trying to prove 
that an individual replica is unchanged since the pre- 
vious digest, the LOCKSS audit mechanism proves 
at regular intervals that the replica agrees with the 
consensus of the replicas at peer libraries. An at- 
tacker seeking to change the content has, therefore, 
to change the vast majority of the replicas within 
a short period. If there are a sufficient number of 
independent replicas this can be made very hard to 
do, especially in the face of the system's internal in- 
trusion detection measures . 

The LOCKSS audit involves peers computing 
and exchanging message digests; they do not have 
to reveal their content to the auditor. This has dis- 
advantages, in that it is not an end-to-end audit, 
and advantages, in that it prevents the audit mech- 
anism being a channel by which content could leak to 
unauthorized readers. A peer whose content doesn't 
match the consensus of the peers can repair it from 
the original publisher, if it is still available, or from 
other peers. 

This mechanism does not depend on anything 
but the content itself being preserved for the long 
term, and is less at risk if the message digest algo- 
rithm is broken. Nevertheless, a system that used 
both forms of audit would be more resistant to loss 
and damage than either alone; the advantages of 
adding previous message digests to the LOCKSS sys- 
tem are outlined in [7]. 

This mechanism also has implications for for- 
mat migration. Obviously, once the peers have 
reached consensus about the information ingested, 
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it can ensure that this consensus is preserved. Now, 
suppose that format migration becomes necessary. 
Consensus must be re-established on the migrated 
version before the mechanism can be applied to it. 
This is one reason why the LOCKSS system pre- 
serves only the original. 

4.6 Economy 

Techniques for reducing the cost of systems are 
always valuable, but they are especially valuable for 
digital preservation systems. Few if any institutions 
have an adequate budget for digital preservation; 
they must practice some form of economic triage. 
They will preserve less content than they should, or 
take greater risks with it, to meet the budget con- 
straints. Reduced costs of acquiring and operating 
the system flow directly into some combination of 
more content being preserved or lower risk to the 
preserved content. 

We discuss cost reduction at each of the stages 
of digital preservation, ingesting the content, pre- 
serving it, and disseminating it to the eventual read- 
ers. At each stage we identify a set of cost compo- 
nents, not all of which are applicable to all systems. 

4.6.1 Economy in Ingest 

The cost of ingesting the content has three 
components: the cost of obtaining permission to pre- 
serve the content, the cost of actually ingesting the 
content, and the cost of creating and ingesting any 
associated metadata. 

Obtaining Permission 

Under the US Digital Millennium Copyright 
Act and similar legislation overseas, permission 
from the copyright owner is required to make and 
preserve copies of copyright material. This applies 
equally to open access, subscription and pay-per- 
view content. Some digital preservation systems, in- 
cluding internal corporate systems and University 
institutional repositories such as DSpace, are in- 
tended to preserve content whose copyright is owned 
by the host institution. They can thus assume per- 
mission, although obtaining explicit confirmation of 
this for each item ingested might be worth the cost. 

The Internet Archive takes the approach that 
"to ask permission is to court denial" , collecting and 
preserving copyright content without obtaining per- 
mission. The downside of this approach is that if 
any copyright owner objects the material in ques- 



tion must be immediately removed, not a viable pol- 
icy for important curated collections. Other systems 
must obtain and preserve a record of their permis- 
sion to preserve. 

Negotiating and obtaining this permission can 
be difficult, time-consuming and expensive. Copy- 
right deposit systems have an established legal 
framework in which to operate [2] , and legal incen- 
tives for publishers to cooperate, which can greatly 
reduce costs. Other systems must negotiate individ- 
ually with each publisher. The costs of doing so have 
been identified as a major impediment to preserva- 
tion of electronic journals [Hj- 

The LOCKSS system allows access to each 
replica only to the host institution's readers and is 
thus able to use a simple one-paragraph addition 
to the publisher's existing license terms. Experi- 
ence so far has shown the cost of negotiating permis- 
sion to be manageable for larger publishers, where 
one negotiation covers many journals, but a signif- 
icant problem for smaller single-journal publishers, 
such as those being selected for preservation by the 
LOCKSS Humanities Initiative 32 . If it is neces- 
sary for each and every journal, even a very cheap 
and easy negotiation gets expensive. Wider adop- 
tion of the Creative Commons license [T2\, which 
provides the permission needed for preservation and 
thus eliminates negotiation, could greatly reduce the 
cost of preservation. 

Ingesting Content 

Just as with obtaining permission, if the inges- 
tion of content and any necessary audit to establish 
its authenticity can be automated the per-item cost 
of ingestion will normally be insignificant. To the 
extent to which humans are involved in the ingest 
process the cost of the process can be very signifi- 
cant. 

Ingesting Metadata 

Much of the discussion of digital preservation 
has focused on the metadata rather than the content 
itself, for example on what metadata should be pre- 
served along with the content EI EH EH , and 
on standards for representing it E21 • There has 
been less focus on where it comes from, and on the 
impacts the costs of creating, validating and pre- 
serving it can have on the overall economics of the 
system. 

To the extent to which metadata, especially 
format and bibliographic metadata, can be supplied 
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by the original creator of the content, or extracted 
automatically from the content itself ^3E], the cos t 
impact will be low. To the extent to which content 
must be elaborated by hand with metadata, the cost 
impact will be significant. The trade-off between 
preserving more content, and providing better qual- 
ity of metadata for the content that is preserved, can 
be very sharp. 

It should be noted that the value of hand- 
generated format metadata in assisting format mi- 
gration has yet to be demonstrated in produc- 
tion systems, and that in an era when access 
via search is dominant the value of even high- 
quality bibliographic metadata is suspect. Re- 
quirements for hand-generated metadata not clearly 
based in an intended use can easily become counter- 
productive |61j . 

The ingestion workflow implemented by 
DSpace collects hand-generated, high-quality meta- 
data from the submitter of the content [HI]. But 
the complexity this adds to the ingest process has 
caused some resistance to its adoption [75] OH] ■ The 
LOCKSS system's ingest process is completely auto- 
mated, collecting only the metadata provided by the 
publisher in its web pages. This has been criticized 
as inadequate ^UJ; systems such as CiteSeer |5| and 
Google Scholar have shown that automatic extrac- 
tion of metadata can be effective but this technology 
has yet to be incorporated. 

4.6.2 Economy in Preservation 

The cost of preserving the content and its as- 
sociated metadata has three components: the cost 
of acquiring and continually replacing the necessary 
hardware and software, operational costs such as 
power, cooling, bandwidth, staff time and the au- 
dits needed to assure funders that they are getting 
their money's worth, and the cost of the necessary 
format migrations. 

Systems with few replicas have to be very 
careful with each of them, using very reliable 
enterprise-grade storage hardware and expensive off- 
line backup procedures. Systems with many replicas 
can be less careful with each of them, for example 
using consumer-grade hardware and depending on 
other replicas to repair damage rather than using 
off-line backups. Our experience is is that the per- 
replica cost can in this way be reduced enough to 
outweigh the increased number of replicas. 

Storage 



The economics of high- volume manufacturing 
mean that consumer-grade disk drives are vastly 
cheaper and only a little less reliable than enterprise- 
grade drives. Based on Seagate's Mean Time To 
Failure (MTTF) specifications, a 200GB consumer 
Barracuda drive has a 7% probability of failing in 
a 5-year service life [55] where a 146GB enterprise 
Cheetah has a 3% probability of failing 59 . But 
the Cheetah costs about $8.20/GB whereas the Bar- 
racuda costs only about $0.57/GB 4 . 

In addition to the severe failures predicted by 
the MTTF specifications, drives specify a rate of un- 
recoverable bit errors, 10~ 14 for the Barracuda and 
10~ 15 for the Cheetah. This is a very low probabil- 
ity, but the disks contain over 10 12 bits. About one 
in every 62 attempts to read every bit from a Bar- 
racuda will encounter an unrecoverable bit error; the 
corresponding figure for the Cheetah is about 1 in 
860. The disks also transfer data very fast. Even if 
the drive averages 99% idle, over a 5-year service life 
the Barracuda will suffer about 8 and the Cheetah 
about 6 unrecoverable bit errors. 

Because the in-service failure probability even 
for expensive drives is so high, enterprise storage sys- 
tems use replication techniques such as RAID 46 . 
These "internal" replicas are costly but of little value 
in digital preservation They provide high avail- 
ability, but spending heavily to improve availability 
is hard to justify for systems such as dark archives 
where the probability of a user access during the re- 
covery time from a disk failure is low. They improve 
the reliability of the data, but not enough to justify 
their cost. The replicas are tightly coupled to each 
other and are thus subject to many correlated failure 
modes [HUSH 

Another reason why digital preservation sys- 
tems might not want to use enterprise-grade hard- 
ware is the cost of power and cooling, which can be 
substantial over the long lifetime of the system. En- 
terprise hardware has to meet exacting performance 
targets and typically does so by using power extrav- 
agantly. Preservation systems have much lower per- 
formance targets and can save power both by using 
consumer-grade hardware and by under-clocking it. 
The Internet Archive has led the way in engineering 
low-power storage systems in this way, spinning off 
a company called Capricorn Technologies to build 
them [0]. 

Operation 

4 Prices from TigerDirect.com 6/13/05 
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As with any activity involving humans, system 
administration is expensive and error-prone. Yet 
digital preservation requires very low rates of system 
administration error over very long periods of time. 
The obvious technique is to assign each replica to its 
own administrative domain, so that a single admin- 
istrative error can affect at most one replica. In a 
peer-to-peer system, such as LOCKSS, this is nat- 
urally the case; other distributed architectures may 
require more costly measures to achieve separate ad- 
ministrative control of each replica. 

Attempts are sometimes made to reduce the 
visible cost of system administration by running the 
digital preservation system as one of a large num- 
ber of services offered by a large shared server, or 
as one of a large number of services sharing a stor- 
age infrastructure such as the Storage Resource Bro- 
ker [S7j. This is often a false economy. Layering 
systems in this way adds significant complexity and 
introduces many failure modes, including hardware, 
software, network, operational and administrative 
failures, that are absent or much less significant in 
dedicated systems. These add greatly to the risks to 
the stored content. In particular, it is impossible to 
prevent errors in other systems which share the in- 
frastructure but are unrelated to digital preservation 
damaging preserved content. Machine and adminis- 
trative boundaries can be very effective at prevent- 
ing faults propagating. 

The only approach to reducing operational 
costs while maintaining low rates of operator error 
is to eliminate, as far as possible, the system's need 
for operator intervention. The large number of repli- 
cas envisaged for the LOCKSS system forced it to 
adopt this "network appliance" approach |53| , which 
has been successful in making the per-replica cost of 
administration affordable. 

Format Migration 

Format migration involves both engineering 
costs, in implementing the necessary format con- 
verters, and operational costs, in applying them to 
the preserved content. The engineering costs will be 
equivalent whatever approach is taken, but the oper- 
ational costs will vary. The operational cost of batch 
migration may be large and will be incurred at un- 
predictable intervals, making it difficult to budget. 
This raises the specter of economic triage, discarding 
material whose migration cost exceeds its perceived 
value. The operational costs of the LOCKSS ap- 
proach of transparent on-access migration are mini- 



mal |53| . 

4.6.3 Economy in Dissemination 

The cost of disseminating the content has two 
components; the cost of complying with any access 
restrictions imposed by the agreement under which 
the content is being preserved (see Section 14.6.1(1 , 
and the cost of actually supplying copies to autho- 
rized readers. 

Complying with the access restrictions typ- 
ically involves an authentication system; Shibbo- 
leth |15j is a current example. This is a system of 
comparable complexity to the digital preservation 
system itself which must be adopted, maintained, 
audited and replaced with a newer system as it be- 
comes obsolete. There are administrative costs in- 
volved too, as users are introduced to and removed 
from the system, and as the publishers with whom 
the agreements were made need reassurance that 
they are being observed. 

Actual dissemination costs such as the cost of 
operating a web server and the bandwidth it uses are 
likely to be relatively low, given the archival nature 
of the preserved content. Content that is expected 
to be popular, such as the UK census data jSH], will 
typically be disseminated from the preservation sys- 
tem once to form a temporary access copy on an 
industrial-strength web server. 

4.7 Sloth 

Digital preservation is almost unique among 
computer applications in that speed is neither a goal 
nor even an advantage. There is normally no hurry 
to ingest content, and no large group of readers im- 
patient for it to be disseminated (see Section 14.5(1 . 
As described above, the lack of a need for speed can 
be leveraged to reduce the cost of hardware, power 
and cooling. It can also reduce the cost of system ad- 
ministration by increasing the window during which 
administrator response is required. Tasks that can 
be scheduled flexibly and well in advance are much 
cheaper than those requiring instant action. But the 
most important reason for sloth is that a system that 
operates fast will tend to fail fast, especially under 
attack |7fil Ifirij . Slow failure, with plenty of warning 
during the gradual failure, is an important attribute 
of digital preservation systems, as it allows time for 
recovery policies to be implemented before failure is 
total. 

The LOCKSS system is an example of the 
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sloth strategy. Its design principle of running no 
faster than necessary was sparked by an early talk 
by Stewart Brand and Danny Hillis about how the 
same principle applied to the design of the "Clock 
of the Long Now" j^] . The principle is implemented 
by rate-limiters, which apply among other things to 
collecting content by crawling the publisher's web 
site, so as not to compete with actual readers, and 
the audit mechanism, so as to prevent an attacker 
changing many replicas in a short period. 

5 Requirements 

Digital preservation systems have a simple 
goal, that the information they contain remain ac- 
cessible to users over a long period of time. In ad- 
dressing this goal they are subject to a wide range of 
threats, not all of which are relevant to all systems. 
We have also shown a wide range of strategies, each 
of which is used by at least one current system. But 
the various systems use various techniques to imple- 
ment each strategy. 

The failure of a digital preservation system will 
become evident in finite time, but its success will for- 
ever remain unproven. Given this, and the diversity 
of threats and strategies, it seems premature to be 
imposing requirements in terms of particular techni- 
cal approaches. Rather, systems should be required 
to disclose their solutions to the various threats, and 
other aspects of the strategies they are pursuing. 
This will allow certification against a checklist of re- 
quired disclosures, and allow customers to make in- 
formed decisions as to how their digital assets may 
most economically reach an adequate level of preser- 
vation against the threats they consider relevant. 

Here is the list of suggested disclosures our 
bottom-up process generated: 

1. Systems should have an explicit threat model, 
disclosing against which of the threats of Sec- 
tion they are attempting to preserve content, 
and how they are addressing each threat. 

2. Systems should disclose how their replicas are 
created and administered, and how any damage 
is detected and repaired. 

3. Systems should disclose the policies and mech- 
anisms they implement to protect intellectual 
property. Specifically: 

• If a system is intended to hold only mate- 
rial whose copyright belongs to the host in- 



stitution, it should disclose how it assures 
that this is in fact the case. 

• If a system is intended to hold mate- 
rial whose copyright belongs to others, 
it should disclose information about the 
agreement under which it is held, such as 
whether and under what terms the agree- 
ment can be revoked by the copyright 
holder, and how the permission granted 
is verified, recorded as metadata and pre- 
served. 

• If a system is intended to hold material not 
covered by copyright, such as US govern- 
ment documents within the US, it should 
disclose how it assures that this is verified, 
recorded as metadata and preserved. 

4. Systems should disclose their external inter- 
faces, in particular their SIP and DIP specifi- 
cations. They should disclose whether, to assist 
external auditing, they are capable of disgorg- 
ing a DIP identical to the SIP that caused the 
content in question to be stored, including not 
just the content but also all the metadata orig- 
inally provided (and none of the metadata that 
it subsequently acquired). 

5. Systems should disclose their source code ac- 
cess policy, and how their source code is to be 
preserved. 

6. Systems should disclose who will conduct au- 
dits, how they will be conducted, and to whom 
the results will be provided. 

7. Systems should disclose their policy for han- 
dling incidents of data loss. To whom are such 
incidents reported and in what form (See Sec- 
tion OJ? 

The work underway to add certification re- 
quirements to OAIS is proceeding along similar lines, 
but from a top-down perspective [51 j . We note that, 
while there are strong relationships between the cri- 
teria in the current draft of these requirements and 
our suggested disclosures, there are very few exact 
correspondences . 

We hope this list will help the process of com- 
ing to consensus on a set of requirements for sys- 
tems to be certified under the OAIS standard. We 
also hope that it will assist system designers and au- 
thors of papers about systems by providing a check- 
list of topics which they have surely considered, but 
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which they may have considered too obvious to doc- 
ument [52 EH 023 E] 
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